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ABSTRACT 

Degree distribution models are incredibly important tools for 
analyzing and understanding the structure and formation of 
social networks, and can help guide the design of efficient 
graph algorithms. In particular, the Power-law degree distri- 
bution has long been used to model the structure of online 
social networks, and is the basis for algorithms and heuris- 
tics in graph applications such as influence maximization 
and social search. 

Along with recent measurement results, our interest in this 
topic was sparked by our own experimental results on so- 
cial graphs that deviated significantly from those predicted 
by a Power-law model. In this work, we seek a deeper un- 
derstanding of these deviations, and propose an alternative 
model with significant implications on graph algorithms and 
applications. We start by quantifying this artifact using a 
variety of real social graphs, and show that their structures 
cannot be accurately modeled using elementary distributions 
including the Power-law. Instead, we propose the Pareto- 
Lognormal (PLN) model, verify its goodness-of-fit using graph- 
ical and statistical methods, and present an analytical study 
of its asymptotical differences with the Power-law. To demon- 
strate the quantitative benefits of the PLN model, we com- 
pare the results of three wide-ranging graph applications on 
real social graphs against those on synthetic graphs gener- 
ated using the PLN and Power-law models. We show that 
synthetic graphs generated using PLN are much better pre- 
dictors of degree distributions in real graphs, and produce 
experimental results with errors that are orders-of-magnitude 
smaller than those produced by the Power-law model. 

1. INTRODUCTION 

Graph degree distributions are fundamental tools used in 
the study of complex networks such as online social net- 
works. Not only do they reveal insights into the structure 
and formation of these networks, but they also lay the foun- 
dation for modeling network dynamics and help guide the 
design of graph algorithms and applications. In particular, 
it is widely believed that the Power-law distribution accu- 
rately captures node degrees in complex networks such as 
online social networks, i.e. their degree distribution follows 
a Power-law function f{x) = ca;~", for some normaliza- 



tion constant c and exponent a. As a result, the Power-law 
model has already played a significant role in guiding the 
design of algorithms on social network problems such as in- 
fluence maximization li9j|, landmark selection [1281 . and link 
privacy protection ifTTl . 

Deviating from Power-law. Our interest in verifying the 
Power-law model for social graphs was first sparked by our 
own efforts to partition social graphs. In a recent project, 
we searched for an efficient way to partition and divide large 
social graphs across distributed machines for parallel graph 
computation. The goal is to minimize edges between parti- 
tions, thereby reducing data dependencies between machines 
when resolving graph queries. 

Given the known density of social graphs, it was not sur- 
prising when popular partitioning algorithms such as Metis 
produced poor partitions (those with low modularity) from 
our Facebook graphs j!39j. Our next step was to leverage the 
known fact that Power-law graphs are vulnerable to targeted 
attacks, i.e. they quickly fragment when their "supernodes" 
are removed HJ . We evaluated a new partitioning approach 
where a small portion of nodes with the highest degree are 
replicated across all machines. This allows us to essentially 
"remove" the supernodes from the graph, which should frag- 
ment the graph into numerous disconnected or weakly con- 
nected subgraphs, which are easily partitioned. 

Surprisingly, our results showed that unlike prior results 
on peer-to-peer networks 1341 . social graphs were extremely 
resilient to this approach. On one of our typical Facebook 
graphs with 500K nodes, cutting the top 10% (50K) supern- 
odes had no impact on the connected graph. Nearly all of 
the remaining nodes (~440K nodes) were still connected in 
a strongly connected component. 

Revisiting degree models. These surprising results led 
us to re-examine how well social graphs obey the Power-law 
degree model. Recent measurements of popular online so- 
cial networks have shed light on their internal structure, and 
also produced a number of real social graphs suitable for 
model validation 1391 l23l l38l 1241 . Despite using the Power- 
law parameter as a popular graph metric, it is increasingly 
evident that the Power-law only captured a portion of the 
degree distribution curve. While never explained, several re- 
sults consistently show that the Power-law distribution over- 
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predicts the number of "high-degree nodes" in social net- 
works ||2][T4|. Similar divergence has also been observed in 
Internet topology and web graphs or mobile cell graphs IZTl 

In this paper, we revisit the problem of modeling node 
degree distributions in online social networks. We question 
existing assumptions, propose the use of an alternative dis- 
tribution model, and show that choosing an appropriate de- 
gree distribution model has dramatic impact on a wide-range 
of OSN and graph applications. We are primarily interested 
in answer three key questions. First, how significant is the 
fitting error when modeling degree distributions using the 
Power-law and other elementary distributions? Second, can 
we propose an alternative distribution that provides a fit with 
significantly better accuracy? Finally, what is the real impact 
of switching to this alternative distribution from Power-law, 
and how does it impact real graph applications? 

Results and Contributions. In addressing the questions 
outlined above, our work makes several contributions to the 
problem of developing a more accurate degree distribution 
model for online social networks. 

First, we assert that the seminal Power-law distribution 
does not accurately model degree distributions in OSNs. We 
verify this hypothesis using a number of measured social 
graphs from the Facebook and Orkut OSNs, and show that 
other elementary distribution also perform poorly. Second, 
we search for a more accurate model by examining complex 
distributions, and propose the use of the Pareto-Lognormal 
(PLN) distribution. Intuitively, the Power-law distribution 
forms when each new node joins the network via a bootstrap 
node hi, and builds edges using the "rich-get-richer" model 
in some graph neighborhood around hi. In contrast, the PLN 
distribution forms when each new node joins multiple com- 
munities and uses a stochastic process to drive connections 
within each community. To support PLN's use in analysis, 
we analytically describe its probability distribution function, 
cumulative distribution function, and maximum likelihood 
function. Using three different error measures, we com- 
pare the accuracy of PLN, Power-law and 5 other models 
on 7 different OSN social graphs ranging in size from 740K 
edges and 14K nodes, to 118 million edges and L6 million 
nodes. Our results confirm that PLN provides the most ac- 
curate edge distribution model. 

Third, we highlight the differences between the PLN and 
Power-law models, quantifying the asymptotic differences 
between them when predicting high degree nodes in the net- 
work. For both models, we derive a close form to bound the 
lowest degree of a node for any percentile of high-degree 
nodes in the network. We then make predictions of the car- 
dinality of nodes in that percentile. Using our social graph 
data, we validate the predictions from both models, and find 
that PLN generally produces errors that are at least two- 
orders-of-magnitude smaller than those of Power-law. 

Fourth, we examine the end-to-end impact of degree dis- 
tribution models on multiple graph applications, including 



graph partitioning, influence maximization ||9l, and attacks 
on social graph link privacy flTj- Prior studies of these ap- 
plications used assumptions of Power-law graphs to drive 
their algorithm design. We implement each application, and 
show that running them on synthetic Power-law graphs pro- 
duced dramatically different results from those on real so- 
cial graphs . In contrast, we show that running applications 
on synthetic PLN graphs produces results nearly identical to 
those produced using real social graphs. Finally, we give 
some preliminary intuition towards the ongoing design of a 
generative model that both captures the temporal evolution 
of social networks and produces degree distributions match- 
ing our PLN model. We draw insights from a series of daily 
snapshots of a Facebook social graph that capture dynamic 
growth over a period of a month. 

Social Graph Datasets. To evaluate our models, we use 
7 real social graphs gathered from recent measurements of 
popular online social networks. The majority of our datasets 
come from Facebook, the most popular OSN today with 
more than 500 million users. We use traces gathered through 
crawls of the Facebook network in 2009, when Facebook 
was structurally organized into geographical/regional net- 
works. We had crawled and analyzed over 10 million users 
(^15% of Facebook in 2008), as part of an earUer measure- 
ment study ll39l . For this paper, we utilize 6 anonymized 
social graphs representing a range of network sizes, from a 
small Monterey, CA graph (13A' nodes, 704A' edges) to a 
large London graph (1.6 million nodes, 118 million edges). 
We also include in our study the "Manhattan Random Walk" 
Facebook graph from |i4|, which has been proven to be a 
representative uniform random sampling of the total Face- 
book graph. We use this graph to validate the representa- 
tiveness of our Facebook results. Finally, we also include a 
public social graph from the Orkut OSN ll23l . With 3 million 
nodes. 111 million edges, it has more nodes but less edges 
than our largest Facebook graph (London). Our datasets and 
their key statistics are summarized in Table [T] 

Roadmap. We begin in Section[2]by examining how well 
elementary functions fit degree distributions from real so- 
cial graphs. Next, in Section |3] we describe the PLN model 
through its PDF, CDF and maximum likelihood functions, 
and show the accuracy of this model through statistical anal- 
ysis. Then, we quantify asymptotic differences between the 
models in Section |4] present experimental application-level 
results in Section |5] and discuss intuition for a generative 
model in Section |6] Finally, we discuss related work in Sec- 
tion|7]and conclude in Section[8] 

2. ELEMENTARY DISTRIBUTIONS 

We start by examining how well elementary distribution 
models fit real social graphs from deployed OSNs. We in- 
clude in our analysis the three elementary distributions: Power- 
law, Lognormal and Exponential distributions ll25l[TTl . We 
leave out the formal introduction of these models, and in- 
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stead focus on presenting the results of our experimental 
analysis. 

Fitting Models to Real Datasets. We fit each distribu- 
tion model to our real OSN datasets, using an optimal esti- 
mator to derive the best model parameters ifTTI . 

For each considered distribution model, we analyze the 
quality of the fit on our real data through three probability 
plots: the probability distribution, the complementary cumu- 
lative distribution (CCDF), and the "quantile-quantile" plot. 

The probability distribution function is an important met- 
ric for understanding the portion of nodes that have a certain 
degree value. For example, the seminal Power-law model 
predicts that a large number of nodes have a high degree. 
The CCDF quantifies how often a node's degree is above 
a value. It is particularly useful for identifying the general 
slope of OSN connectivity. Finally, the Quantile-Quantile 
analysis can graphically compare different distributions, in 
our case a theoretical model and a real social graph. The 
plot shows the discrepancy between degree values that corre- 
spond to the same quantile in both distributions. The greater 
the distance from the reference line, the stronger the evi- 
dence that the dataset follows a different distribution. 

Experimental Results. We perform experiments on our 
6 Facebook datasets, the random walk Facebook graph, and 
the Orkut dataset. Results across all graphs are highly con- 
sistent, and we only show 3 graphs for brevity, Santa Barbara 
Los Angeles and London Facebook networks. These social 
graphs range in size from 13K to 1.6M nodes. 

Figure [Uplots the CCDF of the fitted models on the three 
datasets. The CCDF presents a detailed view on the tail of 
these distributions. We see that none of the presented mod- 
els accurately captures the decay slope shown by the real 
datasets. Both the Power-law and the Lognormal distribu- 
tions overestimate the tail. Take London for example (Fig- 
ure [T]i. The Power-law model overestimates the highest de- 
gree nodes by up to 5 orders of magnitude compared to the 
real graph. In contrast, the Exponential distribution decays 
sharply, and significantly underestimates the number of high 
degree nodes. On the London graph, this error reaches 4 
orders of magnitude for nodes with degrees of 2000. 

The quantile-quantile plot formally quantifies the distance 
between the real data and the model. Figure |2] shows that 
both the Power-law and the Lognormal model underestimate 
nearly 90% of nodes {i.e. Figure |2l x-axis G [1,500]) and 
largely overestimate the very high degree nodes. In con- 
trast, the Exponential distribution model exhibits a reason- 
ably good behavior along the lower tail and the body of the 
distribution compared to the dataset, although a strongly di- 
verging slope is displayed on the high degree nodes. 

We do not include plots on the comparison of probability 
distribution functions because of space constraints. How- 
ever, they lead to the same conclusion: the high density of 
nodes with low degree are not well predicted by the Power- 
law and Lognormal models, and all models fail to accurately 



predict the number of high degree nodes. Finally, while 
this section only reports results from graphical assessment 
of goodness-of-fit, we confirm our observations using statis- 
tical error measures later in Section [33] 

3. COMPLEX DISTRIBUTION MODELS 

Our experimental results show that none of the elementary 
distributions, including the Power-law distribution, provide 
a satisfactory fit for degree distributions in today's OSNs. 
These results, however, do lead to an intuition where low de- 
grees are distributed following a Pareto model while higher 
degrees can be modeled with an asymptotically faster decay- 
ing distribution. We believe a combination of two models 
will offer a better representation of the complex phenomena 
observed on the OSNs. Similar approach has been success- 
fully used by other fields, for example, modeling actuarial 
data in economics ||29l . 

In this section, we test our intuitions by evaluating four 
distribution models with a Pareto component, each with vary- 
ing degrees of accuracy and fitting complexity. Ultimately, 
we confirm via both graphical and statistical analysis that 
the beginning of these distributions is characterizable as a 
Pareto and the upper tail decays as a Lognormal, and that 
the Pareto-Lognormal distribution provides the best combi- 
nation of accuracy and fitting overhead. 

3.1 Pareto with Exponential Cutoff 

Prior work has proposed modeling human behavior using 
the Pareto with Exponential cutoff distribution ll36l . where 
the beginning of the distribution follows a Pareto and the 
tail exponentially decays. This formulation is particularly 
interesting to our study given the decay observed in the tail 
of our data (see Figure [U. We study this distribution using 
the derivations from lITSi . 

Fitting to real datasets. Figure [3] shows how well the 
Pareto with Exponential cutoff distribution fit the Santa Bar- 
bara, Los Angeles and London datasets. We plot the CCDF 
to graphically show how the tail of this distribution decays 
compared to real datasets. While this model is able to closely 
fit the low degree nodes of the real data, it fails short in cap- 
turing the upper tail by underestimating the density of nodes 
with high degrees. We conclude that the exponential cutoff 
is too sharp to properly capture OSN connectivity of central 
(high-degree) nodes. Intuitively, we are looking for a more 
gradual decay slope. 

3.2 Pareto-Lognormal Distribution 

Given the low accuracy in modeling the upper tail of our 
data with the Pareto with exponential cutoff, we turn our at- 
tention to a family of distributions that mixes Pareto with 
Lognormal distributions. These models, compared to the 
Pareto with exponential cutoff model, provide a smother de- 
cay on the upper tail that provides a better match for our real 
data. We start with the Double Pareto-Lognormal (DPLN) 
distribution introduced by Reed in [32 J. The DPLN is a 



3 




10 100 

Node Degree 




1 

0.1 
0.01 
0.001 
0.0001 
1e-05 
1e-06 









Power Law 


■ \ I 









100 
Node Negree 



Figure 1: Complementary cumulative distribution function of elementary models fitted on Santa Barbara, Los Angeles and London Facebook 
graphs. 
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Figure 2: Quantile-Quantile plot of elementary distributions fitted on the London dataset to graphically show the similarity of real social network 
data to a particular distribution model. 
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Figure 3: The CCDF of the Pareto with Exponential Cutoff fitted on Santa Barbara, Los Angeles and London networks. 



complex model with the ability to fuse two Pareto and a Log- 
normal distribution. It includes four parameters; two Pareto 
exponents a and /3 that identify the slope of the upper and 
lower tails of the distribution, and /i and r that describe the 
Lognormal parameters connecting the two Pareto tails. The 
DPLN also gives rise to two other distributions 13211 : the 
Pareto-Lognormal (PLN) and the Lognormal-Pareto (LNP). 
Both are expressed by three parameters: Pareto exponent 
and Lognormal components fjL and r. 

Next, we derive the precise formulation of the PLN dis- 
tribution in terms of the PDF, CDF and the likelihood func- 
tion. We omit the detailed derivation on DPLN and LNP for 
brevity. In Section [3721 and [331 we use both graphical and 
statistical assessments of goodness-of-fit to compare these 
three distributions. 

Pareto-Lognormal PDF and CDF Derivation. The 

PLN is expressed as the combination of two probability dis- 
tributions. 

We derived the correct formula of this distribution (which 
is a limit form of the DPLN) using results on the DPLN |[32l . 



The Pareto-Lognormal probability distribution function is: 

fix) = /3x''-ie(-^-+^)<i>^ nogx-^^ + Pr-\ 



where p characterizes the slope of the lower tail of this distri- 
bution which follows a Pareto behavior, and /x and r charac- 
terize the body and the upper tail of this distribution, which 
approximate a Lognormal decline. We also derived the Pareto- 
Lognormal cumulative distribution function, formalized as 
follows: 



F{x) = $ 



log X — 11 



log X — 11 + fir 



(2) 

with E[X]= and V AR[X] ^T^ + j^.ln SectionH 

we will analyze the CDF and formally prove that the Pareto 
component describes the low values of the distribution and 
the Lognormal one dominates the high values. 

Pareto-Lognormal Likelihood Derivation The likeU- 
hood function estimates the parameters of a distribution func- 
tion in light of the observed data. The likelihood L is a 
function of parameters 6 of the distribution and is defined 



4 



9.72e+06 

9.7e+06 - 

9.68e+06 - 

9.66e+06 - 

9.64e+06 - 

9.62e+06 

9.6e+06 

9.58e+06 - 

9.56e+06 — 
0.7 




O 



10000 



1000 



100 r 



10 



Pareto-Lognormal 
London 



10 100 1000 

Theoretical Quantiles 



10000 



Figure 4: The /3 value corresponding to the minimum reverse log 
likelihood value maximizes the likelihood of the Pareto-Lognormal dis- 
tribution to fit the London sample. 

as follows: 

n 

m = '[[f{x,\e) 

i=l 

Using the definition of the Pareto Lognormal distribution in 
Equation[T] the likelihood becomes: 

The likelihood is defined as a product, and maximizing a 
product is usually more difficult than maximizing a sum. In- 
stead, we use a monotonously increasing conversion func- 
tion to transform the function L{f3, fi, r) into a new func- 
tion L'{P,ii,t), such that L{(3,fi,T) and L'{f3,fi,T) have 
their maximum values for the same /3, fi and t values. The 
monotonous transformation we use is the logarithmic func- 
tion, which turns the maximization of a product into an eas- 
ier maximization of a sum. For simplicity, let Aq be — + 
then 

n n 

log L = log f{xi) = log fix,) = 

i=l i=l 
n n n n^, \ a 2 \ 

nlog /3 + (/3 - 1) log X. +nAo + X: ( ~ ^ 0) 

i=l i=l V ''" / 

We use Q to estimate the parameters (3, ji and r in order 
to fit the Pareto-Lognormal model to our real OSN graphs. 
We also reverse the sign of (O such that the likelihood to fit 
the data is maximized by minimizing —L'{f3, fi, r). 

Parameter estimation. The Pareto-Lognormal distri- 
bution has three parameters (i.e. (3, ji and r) to estimate in 
order to fit real data. We perform a grid search, i.e. a multi- 
dimensional numerical search over the parameters space, to 
identify the best triplet of values to fit a particular dataset, as 
in 



Figure 6: The Pareto-Lognormal Quantile-Quantile plot on London 
shows that this model almost perfectly approximates the real data. 

We bound the search of the parameters /i and t using the 
Moment Method Estimation to determine the initial values, 
and then refine the search around those values to identify the 
best ones. While this methodology could lead to suboptimal 
results that lower the performance of our models when com- 
pared to those using optimal parameter configurations, we 
choose it because of its computational efficiency. 

We use the likelihood metric as the objective function in 
our parameter search IQ, i.e. our goal is to minimize the 
reverse log likelihood "—L{(3, fi, r)" when searching for the 
optimal (f3, i^i, r) ||6]. To show that there is a clear concave 
trend around the minima, we plot in Figure |4] the values of 
— L(/?, fi, t) for the London dataset as a function of /3. We 
have identified similar trends in all our datasets. 

Fitting to real datasets. Figure|5]examines the results of 
fitting the PLN, DPLN and LNP distributions to our graphs 
as CCDF plots. Among these three models, LNP overes- 
timates the data on the upper tail, which suggests that the 
decay of the right tail follows more of a Lognormal model 
than a Pareto one. The flexibility of the DLPN model, with 
its four parameters, allows for a very good fit. However, 
it comes at the cost of a higher fitting complexity which 
grows exponentially in the number of parameters. On the 
other hand, the PLN distribution achieves the accuracy of 
the DPLN, and also has the reduced fitting complexity of 
LNP {i.e. 3 parameters instead of 4). Figure |5] clearly shows 
that on all three datasets (i.e. Santa Barbara, Los Angeles 
and London), PLN and DPLN overlap along the entire dis- 
tribution. In addition, we use the Quantile-Quantile plot to 
evaluate the fitting accuracy of PLN, and show the results for 
London in Figure|6l This plot is representative and others are 
omitted for brevity. It shows a near perfect fitting from the 
Pareto-Lognormal to the real London graph. 

3.3 Statistical Analysis 

The graphical assessments in Section |2] {i.e. the PDF, 
CCDF and the Quantile-Quantile plot) are the first step to- 
wards a complete characterization of these distributions. We 
now look at statistical measures to quantify how well each 
model fits real social graphs ifTOl . 
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Figure 5: The CCDF of the DPLN and its two limit forms, i.e. Pareto-Lognormal (PLN) and Lognormal Pareto (LNP) fitted on Santa Barbara, 
Los Angeles and London networks. The PLN provides the same accurate fit as the DPLN but with much lower fitting complexity. 
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9.56626e+06 
1.91325e+07 
2.12646e-01 


Orkut 
nodes 3 M 
edges Si 111 M 


Log L. 
AIC 
RSS 


1.81346e+07 
3.62692e+07 
1.9428()e+03 


1.63179e+07 
3.26358e+07 
2.41064e+00 


1.62442e+07 
3.24885e+07 
3.75274e-01 


1.61573e+07 
3.23147e+07 
2.95699e-01 


1.61822e+07 
3.23645e+07 
4.16978e-01 


1.60868e+07 
3.21738e+07 
1.60096e-02 


1.60829e+07 
3.21659e+07 
1.01767e-02 



Table 1: Quantifying the "Goodness of the fit" of 7 distribution models via statistical methods on 6 Facebook datasets crawled in 2009, and on the 
Orkut dataset. 



Goodness-of-Fit Analysis. We consider three measures 
to evaluate the goodness of the proposed statistical models, 
including the likelihood function, the Akaike's Information 
Criterion and the Residual Sum of Squares. 

The likelihood function is used to quantify the likelihood 
that a particular model fits a given dataset. Section |32] ex- 
plained how to estimate this function for PLN by computing 
the minimum of the reverse log likelihood. We compute the 
minimum of the reverse of the log likelihood function for 
each model such that the best fit models generate the lowest 
values. This matches the two other measures, where the best 
models also generate the smallest fitting errors. 

Akaike 's Information Criterion (AIC) |I3| is a measure of 
the quality of the fit that is capable of capturing the tradeoff 
between accuracy and fitting costs. The AIC value is com- 
puted as; AIC = 2fc — 2 log L where k is the number of 
parameters in the statistical model, and L is the maximized 
value of the likelihood function for the estimated model. The 



value k in the AIC test is used to tradeoff the accuracy of a 
model with its complexity. The model that shows the lowest 
AIC is considered the one that best fits the data. 

Residual Sum ofSquares(RSS) 1351 is a statistical method 
that computes the sum of squares of residuals between the 
empirical distribution and the data sample. It measures the 
discrepancy as euclidean distance between the data and the 
estimation model. A small RSS indicates a tight fit of the 
model to the data, and it can be formally expressed as: RSS = 
J2 iVi ~ /(^^i))^' where yi is the empirical evaluation and 
f{xi) is the estimate value of the statistical model. 

Error Measures on Real Traces. We now compare mod- 
els based on the numerical values of these three statistical 
methods computed on our OSN datasets. We explore sam- 
ple size variation to prove that the datasets manifest the same 
trend across different sample sizes. 

In Table [U we report a set of statistical values to quantify 
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the goodness-of-the-fit for each of the seven analyzed mod- 
els. We highlight in bold the smallest values (i.e. those that 
identify the best model) for each metric. 

Across all datasets, the Power-law model consistently per- 
forms the worst. Its values of Log Likelihood and AIC test 
are the highest and the RSS values are up to 4 orders of 
magnitude higher than the best model. The second worst 
model on our measured datasets is the Lognormal. Its RSS 
values are up to 3 order of magnitude larger than the best 
model, due to the high imprecision in estimating the high 
degree nodes. Exponential, Pareto with Exponential cutoff 
and LNP provide reasonable accuracy in the sense their val- 
ues are with 1 order of magnitude from the best. 

The two best models are DPLN and PEN. The RSSs for 
some datasets identify DPEN as the best model, with a very 
small difference separating it from PEN (on the second or 
third decimal point). On the other side, the AIC test slightly 
penalizes the model with more parameters, such as DPEN. 
Based on the results of both the likelihood and the AIC, we 
see that PEN does consistently well on all our Facebook 
datasets. Finally, analysis of the Orkut graph produces re- 
sults consistent to our Facebook graphs. 

Overall we see a consistent trend: Power-law does not 
produce accurate results, and the best models are PEN and 
DPEN. Only in a small number of cases, DPEN is slightly 
more accurate than PEN, but differences are exceptionally 
small compared to other models. Given the significant in- 
crease in fitting complexity for DPEN (i.e. deriving 4 rather 
than 3 parameters), the PEN model clearly produces the best 
combination of fitting accuracy and complexity. 

4. IMPLICATIONS OF THE PLN MODEL 

We have shown that the PEN model provides a much more 
accurate fit to today's OSN graphs. But are these differences 
large enough to really matter? In this section, we answer 
this question by quantifying the magnitude of error intro- 
duced by the Power-law model, and in the process, show 
that social algorithms and protocols based on the Power-law 
assumption must be re-evaluated (and some re-designed) us- 
ing the PEN model. 

The remainder of this section includes three parts. First, 
we analyze PEN using its CDF to characterize the asymptot- 
ical slope of its tail, which we use later to derive more com- 
plete bounds, in order to better understand the high degree 
nodes of these networks. Second, for both PEN and Power- 
law model, we analytically determine a close form to bound 
the lowest degree of high degree nodes. By comparing these 
bounds, we formally quantify the divergence of Power-law 
from PEN. We also validate our analytical results from em- 
pirical analysis on our Facebook and Orkut graphs. Finally, 
for both models, we approximate the cardinality of high de- 
gree nodes (those with degrees in the top 10% of the net- 
work), and evaluate the discrepancy between the two. Again, 
we validate our analytical results using our real datasets to 
understand the actual prediction errors from both models. 



4.1 Modeling High Degree Nodes with PLN 

The cardinality of high degree nodes is a key factor in de- 
signing social applications and protocols. To characterize 
this distribution in the PEN model, it is necessary to under- 
stand the limit behavior on its CDF (or F{x)), defined by 
(|2]i. We aim to bound the limit behavior so that we can di- 
rectly derive the expected number of high degree nodes and 

their connectivity. Eet z = ]£M^^ A = e^^l^i^+'^\ we 
have 

F(x) = <l>{z) + x'^A<!>''{z + f5T) (4) 

Leveraging the erf function (i.e. the Gauss error function, 
which is a special function of sigmoid shape), we can ex- 
press the standard normal cumulative distribution and 
its complementary form $'^(a;) respectively as follows: 

$(x) = i[l + er/(^)] = ier/c(--^) (5) 
*'^(x) = l-i[l + er/(^)] = ier/c(^) (6) 

with X € 5R. Note that by definition, erf{—x) = —erf{x) 
and erfc{x) = 1 — erf{x). We can now reformulate dUl 
through the use of Gauss error functions as: 

F(.) = ie./c(--^)+.^4e./c(i±|^) (7) 

Next we use the asymptotical expansion of the Gauss error 
function to further expand F{x). For large x, the erfc{) 
function can be approximated with the following series: 

+00 „ I 

er/c(x)«^^(-l)"-f^ (8) 
xwn ^-^ n!(2a;)^" 

Without loss of generality, in the remainder of this analysis, 
we will only use the first term of the series in (|8]). This is be- 
cause it achieves enough accuracy to accomplish the goals of 
this investigation. While we omit a formal error analysis of 
the loss in the approximation, we will show via experimental 
analysis that the loss is negligible, and the results are suffi- 
ciently accurate. Thus the erfc{) function is approximated 
as follows: 

erfc{x) « ^ (9) 

Lemma 1. 4521/ For a sufficiently large x, 1 — {^{z) + 
x^A^^iz + I3t)) goes to 

The claim states that for large x the PEN distribution has 
a Lognormal form with the parameters of the PLN model. 
We will use this Lemma extensively in the following discus- 
sions. 

4.2 Quantile Analysis: a Degree Threshold 

Next, we quantify how well the PLN and the Power-law 
models predict the degree threshold that defines the top 7 
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subset of the highest degree nodes in the network. In other 
words, we wish to predict the minimum degree such that 
nodes with degree > are in the top 7 portion of all nodes 
sorted by degree. Formally, is the 7-th quantile of the 
complementary cumulative distribution. As a concrete ex- 
ample, we will use 7 = 0.10, i.e. top 10%. 

We compute ^-y for both models. For the Power-law model, 
the degree ^-y can be expressed as: (^)o. To quantify the 
same quantile for the PLN distribution, we again leverage its 
limit behavior. We then use the asymptotic expansion of the 
complementary Gauss error function to obtain a tight upper 
bound in Lemma|2l 



Lemma 2. A tight upper bound of the "f-th quantile of 
the PLN distribution is 



„M+\/-2rMog(:^) 



Proof. Given the result in Lemma \T\ we consider 1 — 
F{x) w ^c^ \ogx-ti ^ jtqj. j^gg values of x. Since we are 
using the Pareto-Lognormal parameters, we can reformu- 
late the complementary cumulative distribution function for 
large a; = values as: 



7 

-(log!:^-n)^/(2x 



2v^ 



log St - 



In order to derive the minimum de- 



gree which characterizes the high degree nodes, we first 
quantify the 7% of high degree nodes in the network, and 
then derive the required degree value. We leverage the tail of 
the complementary cumulative distribution from Lemma [1] 
and thus, the following holds: 

-(logC^-M)V(2r") 

■=7 (10) 
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Let y = log — /i, then ( fTOl i becomes - 



,^ ^, ^ ^ y 

Applying the logarithmic function and approximation with 



the quadratic term, it reduces to y = 
By substituting y, we have log ~ M 



-2r2 1og(2:^) 



.2r2log(2^) 



or^^ 



which is the minimum degree 



of the high degree nodes in the 7-th quantile. □ 

Next, we approximate the differences among the ^-y quan- 
tiles between the PLN and the Power-law distributions. Note 
that we are not approximating through a limit formulation 
but we are estimating the difference of the two, i.e. PLN and 
Power-law, analytical quantile values when 7 e [0.01,0.1]. 
In order to compute the difference of the estimated by 
the Power-law and the PLN, we approximate the PLN's 



as e^, since y —2t^ log ( 



/27r7 \ 



is negligible compared to 1^1 



when 7 is around its typical value 0.05. 

The Power-law overestimation can be computed as the ra- 
tio of the 5^ quantiles of the Power-law and the PLN. Since fi 
and i are approximated by the mean value, i^, of the sample 
logarithms, the following theorem applies: 

Theorem 1. For 7 e [0.01,0.1], the ratio of the 
quantiles of the Power-law and PLN is w ( — )'^- 



Real 


«7 


(see Lemma 2) 


Graphs 


Power-law 


PLN 


Real Data 


Monterey 


8906 


582 


246 


Santa Barbara 


160362 


845 


345 


Egypt 


29987 


775 


206 


Los Angeles 


170457 


1101 


363 


New York 


152937 


1099 


395 


Manhattan R.W. 


357804 


797 


583 


London 


176742 


939 


364 


Orkut 


41377 


369 


155 



Table 2: Comparing the min degree of the top 10% high-degree nodes 
on our datasets, against the values predicted by the theoretical bounds. 



Both the PLN and Power-law estimates will express the ^-y 
degree as an exponential function, but they have different 
bases. For the Power-law, the base of the exponent is i = 
10; for PLN it is e. This discrepancy accounts for the large 
difference in the predictions made by the two models. As 
we will show using experimental validation on real datasets, 
this difference can be as large as two order of magnitude. 

Experimental Validation on Real Data. Now we aim 

to validate the theoretical result shown in the previous sec- 
tion using experimental analysis on our datasets. For each 
dataset presented in Table [U we compute its value {i.e. 
the estimated minimum node degree within the 10% of the 
highest degree nodes), and compare them against their ana- 
lytical bounds from the PLN (see Lemma|2|i and Power-law 
models. 

We list the results in Table |2l Clearly, our theoretical ap- 
proximation of values using PLN are consistently more 
accurate than those from the Power-law model, which over- 
estimates the real data up to 3 orders of magnitude. For in- 
stance, on the New York graph, the ^-^ value from the real 
sample is 395; our theoretical approximation from PLN is 
1099; but the Power-law predicts 152937. 

4.3 Cardinality of High-degree Nodes 

We next use the distribution models to predict the cardi- 
nality of high-degree nodes in OSN graphs. The cardinality 
of high-degree nodes is commonly used to design social al- 
gorithms and protocols, and to evaluate their performance 
and complexity. In the following, we first analytically quan- 
tify this metric using both the Power-law and PLN mod- 
els, and then empirically validate our predictions using real 
datasets. 

Power-law model. Using the Power-law density func- 
tion, we can derive the number of high degree nodes by com- 
puting the integral of the upper tail of the distribution. Let ^ 
be the minimum node degree among the high degree nodes, 
then the number of high degree nodes is approximated as: 



TV 



cx "dx = cN- 



1 



N 



a-1 



(11) 



where c is the normalization constant, approximated by {a - 
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Real 


# of nodes with degree > ^-y 


Graphs 


Power-law 


PLN 


Real Data 


Monterey 


1843 


1387 


1385 


Santa Barbara 


6763 


2442 


2724 


Egypt 


61170 


30387 


28319 


Los Angeles 


141980 


54894 


57182 


New York 


204540 


77879 


85581 


Manhattan R.W. 


76133 


76133 


72854 


London 


398529 


156850 


160050 


Orkut 


761970 


334900 


307240 



Table 3: Number of nodes with degree higher than ^-y, with ^-y esti- 
mated from the 10% of the nodes, computed on the fitted models and 
the real data. 
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Figure 7: When predicting # of high degree nodes, results from the 
Power-law and PLN models diverge up to 5 orders of magnitude. 

l)dQ^^, and do is the minimum node degree in the network. 

PLN model. As mentioned before, a high degree node 
has a degree at least where ^ can be computed as in Sec- 
tionl4.2l Thus, the next lemma follows: 



Lemma 3. Let ^ be the minimum degree for a high de- 

N e V2t 



gree node, then the number of high degree nodes f 



Using Lemma [T] the number of high degree nodes de- 
scribed by the PLN distribution can be studied from the ^^ {x) 
component of the CCDF of the PLN, where x is l^sliiii. 

Thus this number becomes: A^^"^ ( ) . Using the ap- 



proximation in (O, we approximate the number of nodes in 



the network with degree no less than ^ by 



N 



It: 



Comparing High-degree Node Cardinality and Connec- 
tivity. 

We use two experiments to validate the quality of predic- 
tions on high-degree nodes from PLN and Power-law. First, 
we study the number of high-degree nodes in the measured 
network datasets, and compare this number with that esti- 
mated from the fitted models in Table |3] Specifically, we 
study the total number of high-degree nodes, defined as the 
nodes with degree higher than generated from the top 
10% of the network as proven in Section l4!2] 

Table |3] shows the cardinality of high-degree nodes in the 
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Figure 8: Comparing the cumulative degree of high degree nodes be- 
tween the Pareto-Lognormal model and the Power-law reveals different 
values up to 3 orders of magnitude. 

measured networks, as well as those predicted by the PLN 
and Power-law models. Clearly, the PLN model provides a 
tight approximation of the real dataset values, while predic- 
tions made by the Power-law model generally lead to more 
than 100% estimation error. 

Empirical results like these can sometimes be biased be- 
cause of specific distributions in the real data. To eliminate 
any possible bias, we generate pure sample networks from 
the Pareto-Lognormal and the Power-law models, then com- 
pute and compare each model's predicted number of high 
degree nodes and their connectivity. 

The pure sample networks are generated as follows. For 
the Power-law, we take the parameters of a fitted model on 
a particular real sample, and generate a pure sample by uni- 
formly extracting N numbers between and 1. Then for 
each extraction a; G A^, a node with the following degree is 
generated: degree^ = —[rr^, where a has the value of the 
Power-law exponent fitted on a real dataset. For the Pareto- 
. Lognormal, we generate a pure sample as follows: let a; be a 
random number uniformly extracted between 1 and A^, and 
let y be a random number from the exponential distribution 



with mean parameter 1//3. Then degree 



xy 



where (3 , fi and r are the parameters of the Pareto-Lognormal 
fitted on the real dataset. In Figures|2]and[8] we compare the 
number of high degree nodes and the sum of all node degrees 
respectively. We do this for A^ ~ 80K, and use the average 
of the fitted model values as the parameters of the Power-law 
and the Pareto Lognormal. Figure|7]shows the divergence of 
these two models in estimating the high degree nodes. In 
particular for nodes with degree > 2000, Power-law gener- 
ates 2 orders of magnitude more nodes than PLN. 

5. IMPACT ON SOCIAL APPLICATIONS 

In the last section, we argued using both analytical predic- 
tions and empirical data that the Pareto-Lognormal model 
provides a much more accurate representation of node de- 
gree distributions in OSNs. In this section, we seek to bet- 
ter understand how the choice of degree distribution model 
impacts the performance of applications on social graphs. 
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(b) Los Angeles, CA 
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Figure 9: Number of users remaining connected after deleting x% of the highest degree nodes from the graph. The reported results compare 
partitioning effects on the synthetic PLN and Power-law graphs and a real graph. The original graphs are Facebook graphs representing users in 
Santa Barbara CA, Los Angeles CA and London UK. 



More precisely, for applications that run on social graphs, 
we wish to quantify just how much the choice of a degree 
distribution model alters their experimental results. 

When actual measured social graphs are not available, so- 
cial applications generally use synthetic graphs generated 
using statistics extracted from real graphs [9|. To highlight 
the differences when applications use either the Pareto Log- 
normal or the Power-law degree model, we take each of our 
real Facebook social graphs, and generate two synthetic ver- 
sions of them: one assuming a PLN model for degree dis- 
tribution, and one assuming Power-law. We generate these 
synthetic graphs using prior works from ll26l 171. which can 
generate a synthetic graph with no self loops, given a speci- 
fied degree distribution and a given network size. 

We present results from implementations of three impor- 
tant applications on social graphs. They include: a) a graph 
partition approach that replicates high degree nodes across 
partitions (introduced in Section [U, b) influence maximiza- 
tion on social graphs using three different information spread 
models ID, and c) a hnk-privacy attack on anonymized so- 
cial graphs ifTTl . 

All results are computed on a local cluster of Dell Xeon 
servers with 24-32GBs of main memory. Memory constraints 
limited the size of graphs we used in our results. Using Ama- 
zon EC2 large memory machines, we verified that our ob- 
served trends hold for larger graphs such as the Orkut graph, 
but such computations were too slow and costly in resources 
for us to generate comprehensive graphs. 

5.1 Partitioning via Supernode Replication 

Efficiently answering queries on large graphs is a difficult 
challenge faced by companies dealing with large network 
datasets, e.g. Facebook, Zynga. Scalable solutions require 
splitting the graph data across machines in computing clus- 
ters. While some systems ll30l[T2ll distribute nodes randomly 
across a cluster, an ideal solution would find a way to dis- 
tribute the graph across the cluster as subgraphs with min- 
imal edges between them. The best solutions would min- 
imize edges between partitions, which minimizes data de- 
pendencies between machines and maximizes parallel query 
processing. 



Unfortunately, graph partition is known to be NP-Complete IS), 
and social graphs are known to be very dense graphs that 
do not partition well. Fortunately, Power-law networks are 
known to be vulnerable to "targeted attacks," i.e. they quickly 
fragment into disconnected subgraphs when nodes with the 
highest degree are removed ||4|. Since social graphs are 
widely accepted as Power-law graphs, we hypothesize we 
can easily partition a social graph by first fragmenting it into 
subgraphs, through the removal of a small number of supern- 
odes. These supernodes (and their edges) can be replicated 
to every node in a cluster. In addition to potentially parti- 
tioning highly connected graphs, this approach is attractive 
because it does not require the entire graph to be in memory, 
and can thus efficiently run on extremely large social graphs. 

We now look at the impact of running this partition scheme 
both on real social graphs, and on synthetic graphs gener- 
ated using the Power-law and PLN synthetic models. We 
take three of our social graphs, the Facebook regional graphs 
from Santa Barbara, CA, USA and London, UK [39|, and 
the Orkut social graph li23l . and create synthetic graphs that 
match each of them in size and node degrees, based on either 
the Power-law or the PLN model for degree distributions. 
We generate synthetic graphs from degree distributions us- 
ing Newman's approach ll26l . 

For each of these graphs, we test the effectiveness of our 
graph partitioning strategy, by incrementally removing nodes 
from the graph, starting with those nodes with the highest de- 
gree. After these super nodes are removed, we examine the 
remaining graph, and measure the size of the largest con- 
nected component as a function of the original full graph. 

We plot the results in Figure |9] for each of the original 
graphs. The results are strongly consistent among our datasets. 
Regardless of the original graph in question, synthetic graphs 
produced using a Power-law distribution are quickly frag- 
mented. Removing just the top 3% of the highest degree 
nodes reduces the largest connected component to only 40% 
of the original graph. In contrast, the real social graphs and 
synthetic graphs from the Pareto-Lognormal distribution are 
highly resistant to the fragmentation. Even after removing 
10% of the highest degree nodes, the graph remains largely 
connected, and the biggest connected component still con- 
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Figure 10: Number of users influenced by a given number of seed nodes under three different influential dissemination models. Seed nodes are 
sorted by degree in decreasing order. 



tains 80% of all nodes. In this case, relying on an inaccu- 
rate degree distribution model produces results fundamen- 
tally different from the original graph. 

5.2 Influence Maximization 

Social networks have proven to be exceptionally useful 
tools for information dissemination and marketing, and are 
used by companies and individuals to promote their ideas, 
opinions and products. One critical problem of interest is 
that of influence maximization, or how to identify an initial 
set of users who can influence the most number of users. 
This is known as the influence maximization problem. 

We examine the impact of degree distribution models on 
the influence maximization problem. Prior works have pro- 
vided algorithms that use statistical methods to model infor- 
mation dissemination over social links 19] [161. We consider 
three different models to spread information: independent 
cascade model, weighted cascade model and linear thresh- 
old. These models differ in how they compute the proba- 
bility of influencing a node in the graph. For example, in- 
dependent cascade assumes a node is influenced indepen- 
dently from nearby nodes, while the probability of a node 
in weighted cascade being influenced is a function of its de- 
gree. In each case, heuristics provided by 121 allow us to 
compute the number of users influenced by an initial set of 
"seed" users. We seek to understand how the spread of in- 
fluence in these models is impacted by the structure of the 
network. We take the Santa Barbara Facebook graph, its two 
synthetic variants based on Power-law and PLN, and com- 
pute how many users are influenced as the number of seed 
nodes increases for each of the three influence models men- 
tioned above. We focus on results from the Santa Barbara 
network, since it is the largest graph we can execute given 
the significant memory footprint of code from ||9l. Prior 
work has shown this graph to be a representative social graph 
in all graph metrics ll33l . 

Figure [To] shows the spread of information for each of the 
three different models on the Santa Barbara network. For 
each model, we compare the results on Santa Barbara with 
results on its Power-law and PLN synthetic graphs. Our re- 
sults are consistent across all three influential models. While 
the exact result differs for each particular influential model. 
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Fraction of Seeds 

Figure 1 1 : The capability to spread an attack through the network 
is highly dependent on assumptions of the underlying topology. The 
attack is much more effective on a Power-law network than both the 
real graph and a Pareto-Lognormal network. 

results from the PLN graph results always closely follow re- 
sults from the real graph. In contrast, results on the Power- 
law graphs always overestimate the number of influenced 
users on the real graph, sometimes overestimating by an or- 
der of magnitude. These results are further confirmation that 
using an inaccurate degree distribution model can introduce 
dramatic errors in application-level experimental results. 

5.3 Attacks on Link Privacy 

For our third application study, we focus on the problem 
of link privacy in social graphs. Prior studies such as WT\ 
used experiments to quantify the impact of attacks to dis- 
close the presence of connections between social network 
users. This example differs from our other application stud- 
ies in that the application results are simulated based on ana- 
lytical derivations that integrate a degree distribution model. 

IITtI presents different approaches to disclosure network 
information by attacking particular nodes in the network. 
The most effective attack strategy is "Highest," which tries to 
compromise nodes by spreading the attack from high degree 
nodes. Intuitively, because high degree nodes have many in- 
cident edges, they are more likely to successfully spread the 
attack across other nodes. The authors provide a mathemati- 
cal form to quantify the effect of this attack by using a prob- 
abilistic approach that leverages the probability distribution 
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Figure 12: The evolution process affects nodes independently of 
their node degree. 

function of the networlfl This leads to a simple question: 
how much does the effectiveness of this attack depend on 
the choice of network topology model? 

Quantifying the strength of this attack relies a notion of 
node coverage. A node is covered if and only if its 1-hop 
neighbors are unequivocally known. This definition allows 
ifTTl to present results where the strength of this attack is 
quantified as the fraction / of nodes whose entire immediate 
neighborhood is known. 

Let us begin by introducing the required variables to re- 
peat the experiment: N, m and do are respectively the total 
number of nodes in the networks, the sum of the node de- 
grees and the minimum degree. The variable D = '^(^^i) 
identifies the sum of the degree of the k nodes from where 
the attack originates. Theorem 3 in |T7l states that if D = 

ineo 2j7j^ then the disclosed nodes after the attack are: 7V(1— 
e — o(l)). e is a variable that we can compute as follows: 
let Co = e^~2^ 5i;i=i d.{xi)) ^ which is needed to estimate the 
fraction of covered nodes. We can derive e as: 
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where k„iax is the maximum node degree, d{x) is the de- 
gree of X and f{x) is the density function of the degree 
distribution. If we assume a Power-law degree distribution, 
we substitute f{x) with ex"" and in the case of the Pareto- 
Lognormal we use f{x) as defined in ([T]i. 

Figure [TT]shows that results on a Power-law network sig- 
nificantly overpredicts the impact of attacks using the "High- 
est" strategy. In contrast, results on the real social graph 
closely match those from a synthetic graph following the 
Pareto-Lognormal degree model. Clearly, applications that 
use the Power-law model in their analytic derivations and 
simulations are also significantly affected by their choice of 
node degree distribution models. 

6. TOWARDS A GENERATIVE MODEL 

In this section, we present preliminary results towards a 



' This use of network topology models is also commonly used 
in problems involving information propagation or dissemination in 
social networks. 



generative model able to reproduce a Pareto Lognormal graph. 
We also show using evidence in our datasets that the for- 
mation of these graphs diverges from the hypothesized pref- 
erential attacliment-hased scheme presented in prior stud- 
ies MM- 

Our goal is to provide an intuition of an algorithm that 
captures a lognormal multiplicative process, and accurately 
models the temporal evolution of online social networks. 
Unlike generative models that focus on reproducing a single 
snapshot of a network 1211 . we focus on modeling the evolu- 
tionary process that captures the network growth. While the 
Power-law curve is generated by an iterative process follow- 
ing the preferential attachment rule, our PLN model may be 
explained by the following process: "A node joins the net- 
work through an introduction node, and builds connections 
within the local community; then it completes its growth by 
joining and growing in multiple other communities." To re- 
alize a generative model based on this intuition, we need 
a stochastic process derived from the PLN distribution that 
balances the Pareto and the Lognormal process to drive the 
growth rate of each node. 

Dynamic Graph Snapshots. To understand the growth 
of social networks, we perform 30 daily crawls of the San 
Francisco Facebook network during the month of November 
2009. We use the same crawling strategy used to capture 
other Facebook datasets 1391 . We use these daily snapshot 
graphs to help us understand the rate of network growth and 
how new edges form. 

Network Generative Models. Since our PLN model 
is itself a combination of a Pareto distribution and a Log- 
normal distribution, our search for an iterative process should 
naturally integrate two theories. Starting with "Preferential 
Attachment" |[37l , which produces a Power-law distribution, 
we add the "Law of Proportional Effects" lT3l . where the 
growth of the degree of a node, at discrete time t + 1, is 
multiplicative and independent from its actual size Xt- This 
contributes to a Lognormal distribution. We study the cor- 
relation of growth and current degree using our 30 snapshot 
graphs of the San Francisco network. The results in Fig- 
ure [12] confirm that the degree growth of different nodes 
is not influenced by their starting degree, meaning that the 
growth is a constant independent of node degrees. 

A Two-Phase Iterative Algorithm. We envision a two- 
phase iterative algorithm that integrates fundamental proper- 
ties from the "Law of Proportional Effects" 1T3) and "Prefer- 
ential Attachment" l37l . The bimodal connotation is meant 
to highlight the ability of our algorithm to integrate the Power- 
law and Lognormal probability models. 

Our two-phase algorithm alternates between adding new 
nodes to the network using a preferential attachment model, 
and growing the connectivity among nodes using the law of 
proportionate effects. Its two phases are driven by a prob- 
ability parameter p. We do not claim to provide a formal 
derivation of the algorithm or its proof of correctness. Both 
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Figure 13: The comparison between the PDFs of the real and syn- 
thetic Santa Barbara network shows that our generative model pro- 
vides an accurate reproduction of the original network. 

are beyond the scope of this paper. Here we limit ourselves 
to preliminary results to validate our underlying intuition. 

Initial Validation. To evaluate our intuition about the 
generative method, we implement a simple instance of a 
"Two-Phased Iterative Algorithm" summarized above, and 
run a structural comparison between the generated graphs 
and the real social network graphs. We fit the parameters of 
this algorithm by following the approach presented in If33l . 

Our goal is to show that our intuitive algorithm for gen- 
erating PLN distributions can generate synthetic graphs that 
accurately reproduce the degree distribution of our real data. 
In Figure[T3] we show the similarity between the pdf of node 
degrees from the real data, versus those from the synthetic 
graph produced by our algorithm. The results are consis- 
tent across all seven of our datasets, and we only show re- 
sults only on Santa Barbara Network for brevit}0. Given 
these initial results, we are continuing to work on formal- 
izing the mathematical formulation of our generative model 
and a more thorough analysis via larger, more complete syn- 
thetic datasets. 

7. RELATED WORK 

Social Networks. Historically, both online and offline 
social networks have been explained through the seminal 
Power-law model. Power-law is often described with the 
"rich gets richer" paradigm which has been proven to hold 
in real datasets across multiple disciplines, including Inter- 
net router topology graphs, biological graphs ifTTl [3T1 . hu- 
man mobility traces |8l, etc. 

'These synthetic graphs also show similar structural characteristics 
to the original graphs, including clustering coefficient, Knn, sepa- 
ration metrics, and betweenness centrality. 



One of the first OSN study was conducted on Club Nexus 
website ||T|. Later analytical studies attracted attention for 
their large scale, including CyWorld, My Space and Orkut 121, 
YouTube, Flickr, LiveJournal in ll23l . and the most recent 
studies of Facebook ||39l and Twitter |fT9l. More recently, re- 
searchers have begun to investigate the temporal properties 
ofOSNs fT8ll24ll22|. 

Preliminary analysis of OSN structures in these and other 
studies has shown that the degree distribution does not fol- 
low a pure Power Law distribution. As a result, followup 
work proposed to segment these distributions and fit the seg- 
mented pieces with distinct Power-law settings lfT4l l2l. 

Social Applications and Systems. We have shown that 
our proposed PLN is statistically more accurate in describ- 
ing OSNs than the seminal Power-law model. We believe 
that many social applications and protocols designed based 
on the Power-law assumption need to be re-evaluated, es- 
pecially algorithms and protocols that rely on the popula- 
tion of high degree nodes or their connectivity. Examples 
include distributed resource replication strategies to mini- 
mize routing delay and social search, epidemics dissemina- 
tion strategies to maximize information spread ||9l, landmark 
selection strategies to accurately predicts shortest paths in 
graphs 1 28 1, community detection to improve social recom- 
mendation systems, and social attack strategies ifTTl . 

8. CONCLUSION 

Degree distributions are incredibly important tools for study- 
ing and understanding the structure and formation of social 
networks. They give us insight into network structure, and 
are the foundations of generative models that model growth 
and network dynamics 1361 l20l 1211 . Finally, they are key 
tools in the design and analysis of efficient algorithms for a 
number of challenging graph problems. 

Our work sheds light on an existing discrepancy between 
the commonly used Power-law model and real measurement 
data from today's online social networks. While most prior 
studies have ignored this error, we show that it is consis- 
tent across different communities and propose the Pareto- 
Lognormal distribution (PLN) as a more accurate alterna- 
tive. Our analysis shows that PLN significantly outperforms 
existing elementary models, and is the most accurate and 
efficient of all complex distribution models we studied. Fi- 
nally, we analytically quantify the magnitude of error reduc- 
tion achieved by moving from the Power-law to the PLN 
model, and confirm our analysis with empirical data. 
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